
The dataset needs to be explored to identify differences between the customers of each product and to find the relationships between the different attributes of the customers. Also the features of the datasets has to be approched to come up with the insights relevant for the business. Python will be used for all these Analysis.
The data is about customers of the treadmill product(s) of a retail store called Cardio Good Fitness. It contains the following variables-
Importing necessary libraries
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# to restrict the float value to 3 decimal places
pd.set_option('display.float_format', lambda x: '%.3f' % x)
Importing the dataset
# Read the CSV file and store it in the Dataframe
cardio_fit_data= pd.read_csv('CardioGoodFitness.csv')
The first and last 5 rows of the dataset
cardio_fit_data.head()
| Product | Age | Gender | Education | MaritalStatus | Usage | Fitness | Income | Miles | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | TM195 | 18 | Male | 14 | Single | 3 | 4 | 29562 | 112 |
| 1 | TM195 | 19 | Male | 15 | Single | 2 | 3 | 31836 | 75 |
| 2 | TM195 | 19 | Female | 14 | Partnered | 4 | 3 | 30699 | 66 |
| 3 | TM195 | 19 | Male | 12 | Single | 3 | 3 | 32973 | 85 |
| 4 | TM195 | 20 | Male | 13 | Partnered | 4 | 2 | 35247 | 47 |
cardio_fit_data.tail()
| Product | Age | Gender | Education | MaritalStatus | Usage | Fitness | Income | Miles | |
|---|---|---|---|---|---|---|---|---|---|
| 175 | TM798 | 40 | Male | 21 | Single | 6 | 5 | 83416 | 200 |
| 176 | TM798 | 42 | Male | 18 | Single | 5 | 4 | 89641 | 200 |
| 177 | TM798 | 45 | Male | 16 | Single | 5 | 5 | 90886 | 160 |
| 178 | TM798 | 47 | Male | 18 | Partnered | 4 | 5 | 104581 | 120 |
| 179 | TM798 | 48 | Male | 18 | Partnered | 4 | 5 | 95508 | 180 |
The shape of the dataset
# checking shape of the data
print("There are", cardio_fit_data.shape[0], 'rows and', cardio_fit_data.shape[1], "columns.")
There are 180 rows and 9 columns.
Finding whether the data contains any missing values or duplicate rows are very important
# checking missing values
cardio_fit_data.isnull().sum()
Product 0 Age 0 Gender 0 Education 0 MaritalStatus 0 Usage 0 Fitness 0 Income 0 Miles 0 dtype: int64
#Checking duplicate rows
df.duplicated().sum()
0
Check the data types of the columns for the dataset
cardio_fit_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 180 entries, 0 to 179 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Product 180 non-null object 1 Age 180 non-null int64 2 Gender 180 non-null object 3 Education 180 non-null int64 4 MaritalStatus 180 non-null object 5 Usage 180 non-null int64 6 Fitness 180 non-null int64 7 Income 180 non-null int64 8 Miles 180 non-null int64 dtypes: int64(6), object(3) memory usage: 12.8+ KB
Let's check the count and percentage of categorical levels in each column
# Making a list of all categorical variables
cat_cols = ['Product', 'Gender', 'MaritalStatus']
# Printing the count of unique categorical levels in each column
for column in cat_cols:
print(cardio_fit_data[column].value_counts())
print("-" * 50)
TM195 80 TM498 60 TM798 40 Name: Product, dtype: int64 -------------------------------------------------- Male 104 Female 76 Name: Gender, dtype: int64 -------------------------------------------------- Partnered 107 Single 73 Name: MaritalStatus, dtype: int64 --------------------------------------------------
# Printing the percentage of unique categorical levels in each column
for column in cat_cols:
print(cardio_fit_data[column].value_counts(normalize=True))
print("-" * 50)
TM195 0.444 TM498 0.333 TM798 0.222 Name: Product, dtype: float64 -------------------------------------------------- Male 0.578 Female 0.422 Name: Gender, dtype: float64 -------------------------------------------------- Partnered 0.594 Single 0.406 Name: MaritalStatus, dtype: float64 --------------------------------------------------
Observations
Checking the statistical summary of the data.
cardio_fit_data.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product | 180 | 3 | TM195 | 80 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age | 180.000 | NaN | NaN | NaN | 28.789 | 6.943 | 18.000 | 24.000 | 26.000 | 33.000 | 50.000 |
| Gender | 180 | 2 | Male | 104 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Education | 180.000 | NaN | NaN | NaN | 15.572 | 1.617 | 12.000 | 14.000 | 16.000 | 16.000 | 21.000 |
| MaritalStatus | 180 | 2 | Partnered | 107 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Usage | 180.000 | NaN | NaN | NaN | 3.456 | 1.085 | 2.000 | 3.000 | 3.000 | 4.000 | 7.000 |
| Fitness | 180.000 | NaN | NaN | NaN | 3.311 | 0.959 | 1.000 | 3.000 | 3.000 | 4.000 | 5.000 |
| Income | 180.000 | NaN | NaN | NaN | 53719.578 | 16506.684 | 29562.000 | 44058.750 | 50596.500 | 58668.000 | 104581.000 |
| Miles | 180.000 | NaN | NaN | NaN | 103.194 | 51.864 | 21.000 | 66.000 | 94.000 | 114.750 | 360.000 |
Age: On average customers are around 28 years old with a minimum of 18 years and maximum of 50 years. Education: Minimum is 12 with a average of 15. So, its clear that each customer crossed the high schooling.Usage: This number represents the average number of times the customer wants to use the treadmil every week. With that the 75% percentile and the maximum values are looking normal.Fitness: Its the self rated fitness score with minimum of 1 and maximum of 5. This data also looks Valid. Income: This number shows the average income of the customers. And maximum of the customers(75%) are getting below 60k. But the maximum value is 104581. So very few people are getting higher income too. Miles: Same like income, in miles also 75% of the customers are running below 115 miles, but some are running more miles upto 360.Let's check the distribution for numerical columns.
# Defining the function for creating boxplot and hisogram
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize) # creating the 2 subplots
sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color="mediumturquoise") # boxplot will be created and a star will indicate the mean value of the column
if bins:
sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, color="mediumpurple")
else:
sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, color="mediumpurple") # For histogram
ax_hist2.axvline(data[feature].mean(), color="green", linestyle="--") # Add mean to the histogram
ax_hist2.axvline(data[feature].median(), color="black", linestyle="-") # Add median to the histogram
Observations on Age
histogram_boxplot(cardio_fit_data,'Age')
Observations on Education
sns.boxplot(data=cardio_fit_data,x='Education')
plt.show()
sns.displot(data=cardio_fit_data,x='Education',kind='kde')
plt.show()
Skewness is a measure of asymmetry of a distribution.
· In a normal distribution, the mean divides the curve symmetrically into two equal parts at the median and the value of skewness is zero.
· When a distribution is asymmetrical the tail of the distribution is skewed to one side-to the right or to the left.
· When the value of the skewness is negative, the tail of the distribution is longer towards the left hand side of the curve.
· When the value of the skewness is positive, the tail of the distribution is longer towards the right hand side of the curve

cardio_fit_data.skew(axis = 0, skipna = True,numeric_only=True)
Age 0.982 Education 0.622 Usage 0.739 Fitness 0.455 Income 1.292 Miles 1.724 dtype: float64
· Kurtosis is one of the two measures that quantify shape of a distribution. kutosis determine the volume of the outlier
· Kurtosis describes the peakedness of the distribution.
If the distribution is tall and thin it is called a leptokurtic distribution(Kurtosis > 3). Values in a leptokurtic distribution are near the mean or at the extremes.
A flat distribution where the values are moderately spread out (i.e., unlike leptokurtic) is called platykurtic(Kurtosis <3) distribution.
A distribution whose shape is in between a leptokurtic distribution and a platykurtic distribution is called a mesokurtic(Kurtosis=3) distribution. A mesokurtic distribution looks more close to a normal distribution.
· Kurtosis is sometimes reported as “excess kurtosis.” Excess kurtosis is determined by subtracting 3 from the kurtosis. This makes the normal distribution kurtosis equal 0.

cardio_fit_data.kurt(axis = 0, skipna = True,numeric_only=True)
Age 0.410 Education 1.033 Usage 0.543 Fitness -0.369 Income 1.374 Miles 4.321 dtype: float64
cardio_fit_data.loc[cardio_fit_data['Education']>18]
| Product | Age | Gender | Education | MaritalStatus | Usage | Fitness | Income | Miles | |
|---|---|---|---|---|---|---|---|---|---|
| 156 | TM798 | 25 | Male | 20 | Partnered | 4 | 5 | 74701 | 170 |
| 157 | TM798 | 26 | Female | 21 | Single | 4 | 3 | 69721 | 100 |
| 161 | TM798 | 27 | Male | 21 | Partnered | 4 | 4 | 90886 | 100 |
| 175 | TM798 | 40 | Male | 21 | Single | 6 | 5 | 83416 | 200 |
Observations on Usage
histogram_boxplot(cardio_fit_data,'Usage')
sns.displot(data=cardio_fit_data,x='Usage',kind='kde')
plt.show()
histogram_boxplot(cardio_fit_data,'Fitness')
Observations on Income
histogram_boxplot(cardio_fit_data,'Income')
Observations on Miles
histogram_boxplot(cardio_fit_data,'Miles')
cardio_fit_data.loc[cardio_fit_data['Miles']>200].shape
(6, 9)
# findig the type of such properties
cardio_fit_data.loc[cardio_fit_data['Miles']>200,'Gender'].value_counts()
Male 4 Female 2 Name: Gender, dtype: int64
Let's explore the categorical variables now
sns.countplot(data=cardio_fit_data,x='Product');
sns.countplot(data=cardio_fit_data,x='Gender');
sns.countplot(data=cardio_fit_data,x='MaritalStatus');
From the above charts, its easy to figure out which categorical value is more customers than the others.
Multivariate data analysis refers to all statistical methods that simultaneously analyze multiple measurements on each individual respondent or object under investigation.
This method is used principally for four reasons, i.e. to see patterns of data, to make clear comparisons, to discard unwanted information and to study multiple factors at once.
cardio_fit_data.cov()
| Age | Education | Usage | Fitness | Income | Miles | |
|---|---|---|---|---|---|---|
| Age | 48.212 | 3.149 | 0.113 | 0.407 | 58844.463 | 13.187 |
| Education | 3.149 | 2.615 | 0.693 | 0.637 | 16704.718 | 25.771 |
| Usage | 0.113 | 0.693 | 1.177 | 0.695 | 9303.043 | 42.710 |
| Fitness | 0.407 | 0.637 | 0.695 | 0.919 | 8467.925 | 39.073 |
| Income | 58844.463 | 16704.718 | 9303.043 | 8467.925 | 272470624.145 | 465265.362 |
| Miles | 13.187 | 25.771 | 42.710 | 39.073 | 465265.362 | 2689.833 |
cardio_fit_data.corr()
| Age | Education | Usage | Fitness | Income | Miles | |
|---|---|---|---|---|---|---|
| Age | 1.000 | 0.280 | 0.015 | 0.061 | 0.513 | 0.037 |
| Education | 0.280 | 1.000 | 0.395 | 0.411 | 0.626 | 0.307 |
| Usage | 0.015 | 0.395 | 1.000 | 0.669 | 0.520 | 0.759 |
| Fitness | 0.061 | 0.411 | 0.669 | 1.000 | 0.535 | 0.786 |
| Income | 0.513 | 0.626 | 0.520 | 0.535 | 1.000 | 0.543 |
| Miles | 0.037 | 0.307 | 0.759 | 0.786 | 0.543 | 1.000 |
# lets check the correlation between two variables- Age & Fitness
cardio_fit_data[['Age','Fitness']].corr()
| Age | Fitness | |
|---|---|---|
| Age | 1.000 | 0.061 |
| Fitness | 0.061 | 1.000 |
plt.figure(figsize=(10,5))
sns.heatmap(cardio_fit_data.corr(),annot=True,cmap='Spectral',vmin=-1,vmax=1)
plt.show()
Observations
From the above chart its clear that none of the two variables are negatively correlated.
Fitness is highly correlated with Usage and Miles which makes sense.
Age is slightly related to Education & Income. But its no where related to Fitness, Usage & Miles.
Heat map is always very much useful to compare the variables in a bigger picture.
#num_var = ['Age','Education','Usage','Fitness','Income', 'Miles']
#sns.pairplot(data=cardio_fit_data[num_var], diag_kind="kde")
sns.pairplot(data=cardio_fit_data, kind="reg")
plt.show()
Observations
Pair plot is very helpful to see how each column/feature is distributed.
Also it helps to find the best fit regression line between the variables.
Age, Income and Miles are right skewed and Education, Usage and Fitness are almost normally distributed.
plt.figure(figsize=(10,5))
sns.scatterplot(data=cardio_fit_data,x='Product',y='Miles',hue='Gender')
plt.show()
sns.lineplot(data=cardio_fit_data, x='Usage', y ='Fitness',ci=None)
plt.show()
plt.figure(figsize=(15,7))
sns.lineplot(data=cardio_fit_data, x='Miles', y ='Fitness',ci=None)
plt.show()
sns.lmplot(y = 'Fitness', x = 'Miles', hue = 'MaritalStatus', data = cardio_fit_data);
sns.barplot(data = cardio_fit_data, x= 'Product', y='Usage', hue ='Gender');
sns.barplot(data = cardio_fit_data, x= 'Product', y='Miles', hue ='MaritalStatus')
plt.xticks(rotation=90);
sns.barplot(data = cardio_fit_data, x= 'Product', y='Usage', hue ='Fitness');
sns.barplot(data = cardio_fit_data, x= 'Product', y='Income', hue ='Fitness')
plt.xticks(rotation=90);
# importing plotly
import plotly.express as px
#Creating a bar chart using plotly to show the top 10 states
fig = px.bar(cardio_fit_data, x="Product", y="Usage",
title ="Usage of each Product",
width = 800, height = 400,
template="simple_white")
fig.show(renderer='notebook')
#fig = px.bar(cardio_fit_data, x="Product", y="Miles", color="Fitness", barmode="group", facet_col="MaritalStatus")
fig = px.bar(cardio_fit_data, x="Product", y="Miles", color="Fitness", facet_col="MaritalStatus")
fig.show(renderer='notebook')
fig = px.scatter(cardio_fit_data, x="Product", y="Miles", color="Gender", symbol="Fitness", facet_col="MaritalStatus")
fig.show(renderer='notebook')
fig = px.scatter(cardio_fit_data, x="Product", y="Miles", color="Gender", symbol="Usage", facet_col="MaritalStatus")
fig.show(renderer='notebook')
We can also seperate the dataset like below if incase we want to deep dive into the dataset
Male = cardio_fit_data['Gender'] == 'Male'
maledata= cardio_fit_data[Male]
Female = cardio_fit_data['Gender'] == 'Female'
femaledata= cardio_fit_data[Female]
single = cardio_fit_data['MaritalStatus'] == 'Single'
singledata= cardio_fit_data[single]
partnered = cardio_fit_data['MaritalStatus'] == 'Partnered'
partnereddata= cardio_fit_data[partnered]
#creates count plot for the number of single male and female customers that bought the different products.
plt.figure(figsize=(10, 5)) #change figure size
sns.countplot(x='Gender',hue = 'Product', data= singledata)
plt.title("Number of single customers with respect to gender and product bought", fontsize = 16) #titles the graph.
plt.ylabel('Product bought', fontsize = 12) #changes y axis label.
plt.xlabel('Gender', fontsize = 12) #changes x axis label.
plt.show() #shows graph.
#creates count plot for all the Age customers who bought the different products.
sns.displot(maledata, x='Age', hue = 'Product', multiple = 'stack')
plt.title("Distribution of male customers age with respect to product") #titles the graph.
plt.ylabel('Product bought') #changes y axis label.
plt.xlabel('Age (years)') #changes x axis label.
plt.show() #shows graph.
sns.relplot(data=cardio_fit_data,x='Usage',y='Fitness',col='MaritalStatus',kind='line', ci=None, col_wrap=4)
plt.show()
# double click on the plot to zoom in
sns.relplot(data=cardio_fit_data,x='Usage',y='Fitness',col='Gender',kind='line', ci=None, col_wrap=4)
plt.show()
sns.relplot(data=cardio_fit_data,x='Income',y='Fitness',col='MaritalStatus',kind='line', ci=None, col_wrap=4)
plt.show()
# double click on the plot to zoom in
sns.catplot(x='Education', y='Fitness', data=cardio_fit_data, kind="bar", hue='Gender')
plt.show()
sns.catplot(x='Product', y='Education', data=cardio_fit_data, kind="bar", hue='Gender')
plt.show()
sns.catplot(x='Usage', y='Fitness', data=cardio_fit_data, kind="bar", hue='MaritalStatus')
plt.show()
sns.catplot(x="Fitness", data = cardio_fit_data, col="Gender", kind = "count");
plt.figure(figsize=(15,7))
palette = sns.color_palette("mako_r", 6)
sns.lineplot(data=cardio_fit_data, x="Miles", y="Income", hue='Fitness', style="Gender", palette="pastel", ci=False,)
plt.ylabel('Income of Customer')
plt.xlabel('Expected Miles to run')
plt.show()
Below graphs will help us to understand the data in terms of the Products
plt.figure(figsize=(15,7))
sns.lineplot(data=cardio_fit_data, x="Miles", y="Usage", hue='Product', estimator='sum', ci=False)
plt.ylabel('Avg # of times per week')
plt.xlabel('Expected Miles to run')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(data=cardio_fit_data,x='Product',y='Usage',showfliers=False) # turning off outliers
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(10,5))
#sns.boxplot(data=cardio_fit_data,x='Product',y='Fitness',showfliers=False) # turning off outliers
sns.boxplot(data=cardio_fit_data,x='Product',y='Fitness')
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(data=cardio_fit_data,x='Product',y='Miles',showfliers=False) # turning off outliers
plt.xticks(rotation=90)
plt.show()
# Fitness measure for every product
sns.catplot(x='Fitness',
col='Product',
data=cardio_fit_data,
col_wrap=4,
kind="violin")
plt.show()
# Usage measure for every product
sns.catplot(x='Usage',
col='Product',
data=cardio_fit_data,
col_wrap=4,
kind="violin")
plt.show()
Creating bins for Age column
18-30 Category.30+ Category.40+ Category.We will use pd.cut() function to create the bins in Age column.
Syntax: pd.cut(x, bins, labels=None, right=False)
x - column/array to binned
bins - number of bins to create or an input of list for the range of bins
labels - specifies the labels for the bins
right - If set to False, it excludes the rightmost edge of the interval
# using pd.cut() function to create bins
cardio_fit_data['Age_Category'] = pd.cut(cardio_fit_data['Age'],bins=[18,30,40,52],labels=['18-30','30+','40+'], right = False)
sns.histplot(data=cardio_fit_data,x='Age_Category',stat='density')
plt.show()
cardio_fit_data['Age_Category'].unique()
['18-30', '30+', '40+'] Categories (3, object): ['18-30' < '30+' < '40+']
cardio_fit_data.head(10)
| Product | Age | Gender | Education | MaritalStatus | Usage | Fitness | Income | Miles | Age_Category | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | TM195 | 18 | Male | 14 | Single | 3 | 4 | 29562 | 112 | 18-30 |
| 1 | TM195 | 19 | Male | 15 | Single | 2 | 3 | 31836 | 75 | 18-30 |
| 2 | TM195 | 19 | Female | 14 | Partnered | 4 | 3 | 30699 | 66 | 18-30 |
| 3 | TM195 | 19 | Male | 12 | Single | 3 | 3 | 32973 | 85 | 18-30 |
| 4 | TM195 | 20 | Male | 13 | Partnered | 4 | 2 | 35247 | 47 | 18-30 |
| 5 | TM195 | 20 | Female | 14 | Partnered | 3 | 3 | 32973 | 66 | 18-30 |
| 6 | TM195 | 21 | Female | 14 | Partnered | 3 | 3 | 35247 | 75 | 18-30 |
| 7 | TM195 | 21 | Male | 13 | Single | 3 | 3 | 32973 | 85 | 18-30 |
| 8 | TM195 | 21 | Male | 15 | Single | 5 | 4 | 35247 | 141 | 18-30 |
| 9 | TM195 | 21 | Female | 15 | Partnered | 2 | 3 | 37521 | 85 | 18-30 |
cardio_fit_data.tail(10)
| Product | Age | Gender | Education | MaritalStatus | Usage | Fitness | Income | Miles | Age_Category | |
|---|---|---|---|---|---|---|---|---|---|---|
| 170 | TM798 | 31 | Male | 16 | Partnered | 6 | 5 | 89641 | 260 | 30+ |
| 171 | TM798 | 33 | Female | 18 | Partnered | 4 | 5 | 95866 | 200 | 30+ |
| 172 | TM798 | 34 | Male | 16 | Single | 5 | 5 | 92131 | 150 | 30+ |
| 173 | TM798 | 35 | Male | 16 | Partnered | 4 | 5 | 92131 | 360 | 30+ |
| 174 | TM798 | 38 | Male | 18 | Partnered | 5 | 5 | 104581 | 150 | 30+ |
| 175 | TM798 | 40 | Male | 21 | Single | 6 | 5 | 83416 | 200 | 40+ |
| 176 | TM798 | 42 | Male | 18 | Single | 5 | 4 | 89641 | 200 | 40+ |
| 177 | TM798 | 45 | Male | 16 | Single | 5 | 5 | 90886 | 160 | 40+ |
| 178 | TM798 | 47 | Male | 18 | Partnered | 4 | 5 | 104581 | 120 | 40+ |
| 179 | TM798 | 48 | Male | 18 | Partnered | 4 | 5 | 95508 | 180 | 40+ |
An outlier is a data point that are abnormally/unrealistically distant from other points in the data.
The challenge with outlier detection is determining if a point is truly a problem or simply a large value. If a point is genuine then it is very important to keep it in the data as otherwise we're removing the most interesting data points.
It is left to the best judgement of the investigator to decide whether treating outliers is necessary and how to go about it. Domain Knowledge and impact of the business problem tend to drive this decision.
Handling outliers
Some of the commonly methods to deal with the data points that we actually flag as outliers are:
So, it is often a good idea to examine the results by running an analysis with and without outliers.
Visualization of all the outliers present in data
# outlier detection using boxplot
# selecting the numerical columns of data and adding their names in a list
numeric_columns = ['Age','Education','Usage','Fitness','Income', 'Miles']
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(3, 3, i + 1)
plt.boxplot(cardio_fit_data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Let's find the percentage of outliers, in each column of the data, using IQR.
Treating outliers
We will cap/clip the minimum and maximum value of these columns to the lower and upper whisker value of the boxplot found using Q1 - 1.5*IQR and Q3 + 1.5*IQR, respectively.
Note: Generally, a value of 1.5 * IQR is taken to cap the values of outliers to upper and lower whiskers but any number (example 0.5, 2, 3, etc) other than 1.5 can be chosen. The value depends upon the business problem statement.
# Finding the 25th percentile and 75th percentile for the numerical columns.
Q1 = cardio_fit_data[numeric_columns].quantile(0.25)
Q3 = cardio_fit_data[numeric_columns].quantile(0.75)
IQR = Q3 - Q1 #Inter Quantile Range (75th percentile - 25th percentile)
lower_whisker = Q1 - 1.5*IQR #Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper_whisker = Q3 + 1.5*IQR
# Percentage of outliers in each column
((cardio_fit_data[numeric_columns] < lower_whisker) | (cardio_fit_data[numeric_columns] > upper_whisker)).sum()/cardio_fit_data.shape[0]*100
Age 2.778 Education 2.222 Usage 5.000 Fitness 1.111 Income 10.556 Miles 7.222 dtype: float64
Q1 = cardio_fit_data['Miles'].quantile(0.25) # 25th quantile
Q3 = cardio_fit_data['Miles'].quantile(0.75) # 75th quantile
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower_whisker = Q1 - 1.5 * IQR
upper_whisker = Q3 + 1.5 * IQR
print(lower_whisker)
print(upper_whisker)
-7.125 187.875
cardio_fit_data.loc[cardio_fit_data['Miles']>188].sort_values('Miles',ascending=False)
| Product | Age | Gender | Education | MaritalStatus | Usage | Fitness | Income | Miles | Age_Category | |
|---|---|---|---|---|---|---|---|---|---|---|
| 173 | TM798 | 35 | Male | 16 | Partnered | 4 | 5 | 92131 | 360 | 30+ |
| 166 | TM798 | 29 | Male | 14 | Partnered | 7 | 5 | 85906 | 300 | 18-30 |
| 167 | TM798 | 30 | Female | 16 | Partnered | 6 | 5 | 90886 | 280 | 30+ |
| 170 | TM798 | 31 | Male | 16 | Partnered | 6 | 5 | 89641 | 260 | 30+ |
| 155 | TM798 | 25 | Male | 18 | Partnered | 6 | 5 | 75946 | 240 | 18-30 |
| 84 | TM498 | 21 | Female | 14 | Partnered | 5 | 4 | 34110 | 212 | 18-30 |
| 142 | TM798 | 22 | Male | 18 | Single | 4 | 5 | 48556 | 200 | 18-30 |
| 148 | TM798 | 24 | Female | 16 | Single | 5 | 5 | 52291 | 200 | 18-30 |
| 152 | TM798 | 25 | Female | 18 | Partnered | 5 | 5 | 61006 | 200 | 18-30 |
| 171 | TM798 | 33 | Female | 18 | Partnered | 4 | 5 | 95866 | 200 | 30+ |
| 175 | TM798 | 40 | Male | 21 | Single | 6 | 5 | 83416 | 200 | 40+ |
| 176 | TM798 | 42 | Male | 18 | Single | 5 | 4 | 89641 | 200 | 40+ |
#Treating outliers
#cardio_fit_data['Miles'] = np.clip(cardio_fit_data['Miles'], lower_whisker, upper_whisker)
#sns.boxplot(data=cardio_fit_data,x='Miles')
#plt.show()
Q1 = cardio_fit_data['Income'].quantile(0.25) # 25th quantile
Q3 = cardio_fit_data['Income'].quantile(0.75) # 75th quantile
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower_whisker = Q1 - 1.5 * IQR
upper_whisker = Q3 + 1.5 * IQR
print(lower_whisker)
print(upper_whisker)
22144.875 80581.875
cardio_fit_data.loc[cardio_fit_data['Income']>80581].sort_values('Income',ascending=False)
| Product | Age | Gender | Education | MaritalStatus | Usage | Fitness | Income | Miles | Age_Category | |
|---|---|---|---|---|---|---|---|---|---|---|
| 174 | TM798 | 38 | Male | 18 | Partnered | 5 | 5 | 104581 | 150 | 30+ |
| 178 | TM798 | 47 | Male | 18 | Partnered | 4 | 5 | 104581 | 120 | 40+ |
| 168 | TM798 | 30 | Male | 18 | Partnered | 5 | 4 | 103336 | 160 | 30+ |
| 169 | TM798 | 30 | Male | 18 | Partnered | 5 | 5 | 99601 | 150 | 30+ |
| 171 | TM798 | 33 | Female | 18 | Partnered | 4 | 5 | 95866 | 200 | 30+ |
| 179 | TM798 | 48 | Male | 18 | Partnered | 4 | 5 | 95508 | 180 | 40+ |
| 162 | TM798 | 28 | Female | 18 | Partnered | 6 | 5 | 92131 | 180 | 18-30 |
| 172 | TM798 | 34 | Male | 16 | Single | 5 | 5 | 92131 | 150 | 30+ |
| 173 | TM798 | 35 | Male | 16 | Partnered | 4 | 5 | 92131 | 360 | 30+ |
| 161 | TM798 | 27 | Male | 21 | Partnered | 4 | 4 | 90886 | 100 | 18-30 |
| 177 | TM798 | 45 | Male | 16 | Single | 5 | 5 | 90886 | 160 | 40+ |
| 167 | TM798 | 30 | Female | 16 | Partnered | 6 | 5 | 90886 | 280 | 30+ |
| 176 | TM798 | 42 | Male | 18 | Single | 5 | 4 | 89641 | 200 | 40+ |
| 170 | TM798 | 31 | Male | 16 | Partnered | 6 | 5 | 89641 | 260 | 30+ |
| 160 | TM798 | 27 | Male | 18 | Single | 4 | 3 | 88396 | 100 | 18-30 |
| 164 | TM798 | 28 | Male | 18 | Single | 6 | 5 | 88396 | 150 | 18-30 |
| 166 | TM798 | 29 | Male | 14 | Partnered | 7 | 5 | 85906 | 300 | 18-30 |
| 175 | TM798 | 40 | Male | 21 | Single | 6 | 5 | 83416 | 200 | 40+ |
| 159 | TM798 | 27 | Male | 16 | Partnered | 4 | 5 | 83416 | 160 | 18-30 |
#Treating outliers
#cardio_fit_data['Income'] = np.clip(cardio_fit_data['Income'], lower_whisker, upper_whisker)
#sns.boxplot(data=cardio_fit_data,x='Income')
#plt.show()
As the outliers of this data set are looking valid and continuous, we don't need to necessarily treat the outliers. Removing the proper values will also impact the analysis.But above method shows the ways of removing outliers if incase we find any
We analyzed a dataset of three different TrendMil products and the customers who bought those. The product was purchased by customers of different ages between 18 to 50 both single & married. The main feature of interest here is the Miles and Fitness values(Self rated value). Its good to see the data related to health and Fitness which is very much necessary these days.
We have been able to conclude that -
Make your body into your most beautiful outfit! Happy Sweating with TreadMills!! Fitness Matters!!!